Hadoop on Docker

适用于版本3.3.0

1 总览

The Linux Container Executor (LCE)允许在物理机或Docker镜像中运行YARN节点。

2 集群配置

为了避免启动作业时超时，需要提前将较大的镜像拉取到Docker缓存中：

1	sudo docker pull library/openjdk:8

yarn-site.xml设置：

<configuration>
  <property>
    <name>yarn.nodemanager.container-executor.class</name>
    <value>org.apache.hadoop.yarn.server.nodemanager.LinuxContainerExecutor</value>
    <description>
      This is the container executor setting that ensures that all applications
      are started with the LinuxContainerExecutor.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.linux-container-executor.group</name>
    <value>hadoop</value>
    <description>
      The POSIX group of the NodeManager. It should match the setting in
      "container-executor.cfg". This configuration is required for validating
      the secure access of the container-executor binary.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.linux-container-executor.nonsecure-mode.limit-users</name>
    <value>false</value>
    <description>
      Whether all applications should be run as the NodeManager process' owner.
      When false, applications are launched instead as the application owner.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.allowed-runtimes</name>
    <value>default,docker</value>
    <description>
      Comma separated list of runtimes that are allowed when using
      LinuxContainerExecutor. The allowed values are default, docker, and
      javasandbox.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.type</name>
    <value></value>
    <description>
      Optional. Sets the default container runtime to use.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.image-name</name>
    <value></value>
    <description>
      Optional. Default docker image to be used when the docker runtime is
      selected.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.image-update</name>
    <value>false</value>
    <description>
      Optional. Default option to decide whether to pull the latest image
      or not.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.allowed-container-networks</name>
    <value>host,none,bridge</value>
    <description>
      Optional. A comma-separated set of networks allowed when launching
      containers. Valid values are determined by Docker networks available from
      `docker network ls`
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.default-container-network</name>
    <value>host</value>
    <description>
      The network used when launching Docker containers when no
      network is specified in the request. This network must be one of the
      (configurable) set of allowed container networks.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.host-pid-namespace.allowed</name>
    <value>false</value>
    <description>
      Optional. Whether containers are allowed to use the host PID namespace.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.privileged-containers.allowed</name>
    <value>false</value>
    <description>
      Optional. Whether applications are allowed to run in privileged
      containers. Privileged containers are granted the complete set of
      capabilities and are not subject to the limitations imposed by the device
      cgroup controller. In other words, privileged containers can do almost
      everything that the host can do. Use with extreme care.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.delayed-removal.allowed</name>
    <value>false</value>
    <description>
      Optional. Whether or not users are allowed to request that Docker
      containers honor the debug deletion delay. This is useful for
      troubleshooting Docker container related launch failures.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.stop.grace-period</name>
    <value>10</value>
    <description>
      Optional. A configurable value to pass to the Docker Stop command. This
      value defines the number of seconds between the docker stop command sending
      a SIGTERM and a SIGKILL.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.privileged-containers.acl</name>
    <value></value>
    <description>
      Optional. A comma-separated list of users who are allowed to request
      privileged containers if privileged containers are allowed.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.capabilities</name>
    <value>CHOWN,DAC_OVERRIDE,FSETID,FOWNER,MKNOD,NET_RAW,SETGID,SETUID,SETFCAP,SETPCAP,NET_BIND_SERVICE,SYS_CHROOT,KILL,AUDIT_WRITE</value>
    <description>
      Optional. This configuration setting determines the capabilities
      assigned to docker containers when they are launched. While these may not
      be case-sensitive from a docker perspective, it is best to keep these
      uppercase. To run without any capabilites, set this value to
      "none" or "NONE"
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.enable-userremapping.allowed</name>
    <value>true</value>
    <description>
      Optional. Whether docker containers are run with the UID and GID of the
      calling user.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.userremapping-uid-threshold</name>
    <value>1</value>
    <description>
      Optional. The minimum acceptable UID for a remapped user. Users with UIDs
      lower than this value will not be allowed to launch containers when user
      remapping is enabled.
    </description>
  </property>

  <property>
    <name>yarn.nodemanager.runtime.linux.docker.userremapping-gid-threshold</name>
    <value>1</value>
    <description>
      Optional. The minimum acceptable GID for a remapped user. Users belonging
      to any group with a GID lower than this value will not be allowed to
      launch containers when user remapping is enabled.
    </description>
  </property>

</configuration>

此外，必须有一个属主为root，权限为0400，格式为Java Properties的container-executor.cfg文件，用户设置容器执行器的属性。

常用的设置：

启用Docker支持：

截屏2021-05-03 下午8.39.44

容器属性：

截屏2021-05-03 下午8.43.47

注意：如果需要访问YARN本地目录，需要设置docker.allowed.rw-mounts。

可选属性：

截屏2021-05-03 下午8.45.12

示例设置：

yarn.nodemanager.linux-container-executor.group=yarn
[docker]
  module.enabled=true
  docker.privileged-containers.enabled=true
  docker.privileged-containers.registries=local
  docker.trusted.registries=centos
  docker.allowed.capabilities=SYS_CHROOT,MKNOD,SETFCAP,SETPCAP,FSETID,CHOWN,AUDIT_WRITE,SETGID,NET_RAW,FOWNER,SETUID,DAC_OVERRIDE,KILL,NET_BIND_SERVICE
  docker.allowed.networks=bridge,host,none
  docker.allowed.ro-mounts=/sys/fs/cgroup
  docker.allowed.rw-mounts=/var/hadoop/yarn/local-dir,/var/hadoop/yarn/log-dir

3 镜像要求

(1) 用户

应用属主被显式设置为容器属主。如果应用属主不是合法的容器属主，或者两者UID不同，将导致启动失败。详见ser Management in Docker Container

(2) 依赖

镜像需要包含应用所需的全部前置条件，如运行时环境、环境变量，并且需要保证版本兼容。

YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE将影响命令运行行为。如果设置了EntryPoint并且设置为true，命令将传递给EntryPoint。

过大的镜像依赖将触发拉取操作，可能引起超时，需要提前缓存。

4 CGroups配置要求

Docker使用cgroups限制单一容器的资源使用。因为运行的容器属于YARN，因此–cgroup-parent可用于设置控制组。

Docker支持两种cgroups驱动，cgroupfs和systemd。但只支持cgroups启动容器，否则报错。

5 应用提交

截屏2021-05-03 下午9.52.30

注意：前两项必需。

6 使用Docker Bind Mounted Volume

注意：不建议绑定系统目录，可能导致信息泄漏。

设置：

管理员设置目录白名单（到父目录），使用docker.allowed.ro-mounts和docker.allowed.rw-mounts
应用提交者使用YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS设置请求的目录，只能匹配管理员的设置。

用户配置按照source:destination(:mode)格式提交，source是物理路径，destination是镜像中映射的路径，mode是读写模式，默认为rw。mode也可以带有绑定传播选项，如shared, rshared, slave, rslave, private, or rprivate。示例：/sys/fs/cgroup:/sys/fs/cgroup:ro

7 用户管理

使用和YARN相同的did:gid组合识别用户，需要保持宿主和容器一致。

非安全模式中使用用户nobody运行进程，详见Using CGroups with YARN。在CentOS中该用户和用户组ID为99，如果在容器中对应的用户不是99，将失败或导致未知结果。

特权容器是一种例外，其将运行容器的用户映射到容器内，因此可以不用设置uid:gid匹配。

Docker不会使用容器内的/etc/passwd或/etc/shadow用于用户认证，需要使用以下方式：

(1) 静态用户管理

用于非安全模式中设置已知的单一用户，适用于测试环境

手动修改UID和GID:

1 2	usermod -u 99 nobody groupmod -g 99 nobody

(2) 绑定

绑定容器外部的用户配置文件/etc/passwd和/etc/group，需要在配置文件container-executor.cfg属性docker.allowed.ro-mounts中指定。提交应用时需要在YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS中包含/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro

局限：

完全覆盖容器内的用户配置
一次性读取，运行时不能修改

(3) SSSD

System Security Services Daemon (SSSD) 。如 LDAP和Active Directory。

# 通常Linux认证模式
application -> libpam -> pam_authenticate -> pam_unix.so -> /etc/passwd
# 使用SSSD后
application -> libpam -> pam_authenticate -> pam_unix.so -> /etc/passwd

步骤：

1) 宿主

# 安装软件
yum -y install sssd-common sssd-proxy

# 为容器创建PAM服务
cat /etc/pam.d/sss_proxy


auth required pam_unix.so
account required pam_unix.so
password required pam_unix.so
session required pam_unix.so

# 创建/etc/sssd/sssd.conf配置文件，属主为root:root，权限为0400
cat /etc/sssd/sssd/conf



[sssd]
services = nss,pam
config_file_version = 2
domains = proxy
[nss]
[pam]
[domain/proxy]
id_provider = proxy
proxy_lib_name = files
proxy_pam_target = sss_proxy

# 启动
systemctl start sssd

## 验证用户localuser
getent passwd -s sss localuser

2) 容器

注意绑定目录/var/lib/sss/pipes，因为SSSD Unix套接字在这个位置。

1	-v /var/lib/sss/pipes:/var/lib/sss/pipes:rw

配置：

# 安装客户端
yum -y install sssd-client

# 确保sss为passwd和group数据在/etc/nsswitch.conf配置

# 配置PAM服务
cat /etc/pam.d/system-auth



#%PAM-1.0
# This file is auto-generated.
# User changes will be destroyed the next time authconfig is run.
auth        required      pam_env.so
auth        sufficient    pam_unix.so try_first_pass nullok
auth        sufficient    pam_sss.so forward_pass
auth        required      pam_deny.so

account     required      pam_unix.so
account     [default=bad success=ok user_unknown=ignore] pam_sss.so
account     required      pam_permit.so

password    requisite     pam_pwquality.so try_first_pass local_users_only retry=3 authtok_type=
password    sufficient    pam_unix.so try_first_pass use_authtok nullok sha512 shadow
password    sufficient    pam_sss.so use_authtok
password    required      pam_deny.so

session     optional      pam_keyinit.so revoke
session     required      pam_limits.so
-session     optional      pam_systemd.so
session     [success=1 default=ignore] pam_succeed_if.so service in crond quiet use_uid
session     required      pam_unix.so
session     optional      pam_sss.so

# 保存并作为应用的基础镜像

# 测试YARN环境中的镜像
id



uid=5000(localuser) gid=5000(localuser) groups=5000(localuser),1337(hadoop)

8 特权容器的安全考量

默认关闭了特权容器。只能对开启ENTRYPOINT的镜像开启docker.privileged-containers.enabled，为了防止对宿主造成不良影响，对宿主的访问关闭，但可以使用root权限操作容器。

用户可以设置受信的镜像，以下library是一个受信的镜像：

1
2
3

[docker]
  docker.privileged-containers.enabled=true
  docker.trusted.registries=library

细粒度控制，docker.privileged-containers.registries缺省时使用docker.trusted.registries。

[docker]
  docker.privileged-containers.enabled=true
  docker.privileged-containers.registries=local/centos:latest
  docker.trusted.registries=library

添加端口标记本地镜像，可以与远程镜像区分：

1	docker tag centos:latest localhost:5000/centos:latest

示例

# 为本地镜像设置标签
docker tag centos:latest localhost:5000/centos:latest

# 添加local到 docker.trusted.registries中

受信的进行可以绑定外部的设备，如HDFS、NFS或宿主级别的Hadoop配置。

详见YARN Service HTTPD example

9 容器再获取需求

重启时，NodeManager通过验证容器pid是否在/proc文件系统中来确认容器是否存活。

出于安全，操作系统管理员可能开启hidepid选项，此时设置类似以下白名单，否则重启失败：

1	proc /proc proc nosuid,nodev,noexec,hidepid=2,gid=yarn 0 0

10 连接受信的Docker注册点

Docker客户端命令会从默认位置NodeManager的$HOME/.docker/config.json获取配置。因为Docker配置是受信证书存储的位置，因此使用带有安全Docker repo的LCE不推荐这种方式。

YARN-5428提供了安全提供Docker客户端配置分布式Shell。

作为临时方案，可以手动将每一个NodeManager主机上的Docker进程使用Docker登录命令登录到安全repo中。

docker login [OPTIONS] [SERVER]

Register or log in to a Docker registry server, if no server is specified
"https://index.docker.io/v1/" is the default.

-e, --email=""       Email
-p, --password=""    Password
-u, --username=""    Username

注意：这种方式意味着所有的用户都可以访问安全repo。

Hadoop通过YARN服务API整合Docker受信注册。Docker注册可以存储镜像到HDFS、S3或使用CSI驱动的外部存储中。

(1) HDFS

NFS Gateway提供了将HDFS绑定为NFS绑定点的能力。

Docker注册可以使用标准的文件系统API配置写出到HDFS。

hdfs-site.xml配置：

<property>
  <name>nfs.exports.allowed.hosts</name>
  <value>* rw</value>
</property>

<property>
  <name>nfs.file.dump.dir</name>
  <value>/tmp/.hdfs-nfs</value>
</property>

<property>
  <name>nfs.kerberos.principal</name>
  <value>nfs/_HOST@EXAMPLE.COM</value>
</property>

<property>
  <name>nfs.keytab.file</name>
  <value>/etc/security/keytabs/nfs.service.keytab</value>
</property>

在所有数据节点：

# 以hdfs用户运行NFS Gateway
$HADOOP_HOME/bin/hdfs --daemon start nfs3

# 暴露nfs绑定点到/hdfs,其中$DN_IP是数据节点IP
mount -t nfs -o vers=3,proto=tcp,nolock,noacl,sync $DN_IP:/ /hdfs

配置Container-executor.cfg，以允许受信镜像：

[docker]
docker.privileged-containers.enabled=true
docker.trusted.registries=library,registry.docker-registry.registry.example.com:5000
docker.allowed.rw-mounts=/tmp,/usr/local/hadoop/logs,/hdfs

可以配置registry.json，使用YARN服务启动Docker Registry:

{
  "name": "docker-registry",
  "version": "1.0",
  "kerberos_principal" : {
    "principal_name" : "registry/_HOST@EXAMPLE.COM",
    "keytab" : "file:///etc/security/keytabs/registry.service.keytab"
  },
  "components" :
  [
    {
      "name": "registry",
      "number_of_containers": 1,
      "artifact": {
        "id": "registry:latest",
        "type": "DOCKER"
      },
      "resource": {
        "cpus": 1,
        "memory": "256"
      },
      "run_privileged_container": true,
      "configuration": {
        "env": {
          "YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE":"true",
          "YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS":"/hdfs/apps/docker/registry:/var/lib/registry"
        },
        "properties": {
          "docker.network": "host"
        }
      }
    }
  ]
}

启动服务：

1	yarn app -launch docker-registry /tmp/registry.json

遵循Hadoop Registry DNS格式访问registry：

1	registry.docker-registry.$USER.$DOMAIN:5000

当registry应用达到了STABLE状态，用户可以使用registry.docker-registry.registry.example.com:5000/前缀推送或拉取。

(2) S3

略

11 示例配置

假设Hadoop安装在/usr/local/hadoop，container-executor.cfg文件中的docker.allowed.ro-mounts已经包含/usr/local/hadoop,/etc/passwd,/etc/group。

(1) MapReduce: 提交Pi计算任务

HADOOP_HOME=/usr/local/hadoop
YARN_EXAMPLES_JAR=$HADOOP_HOME/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar
MOUNTS="$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro"
IMAGE_ID="library/openjdk:8"

export YARN_CONTAINER_RUNTIME_TYPE=docker
export YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID
export YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS

yarn jar $YARN_EXAMPLES_JAR pi \
  -Dmapreduce.map.env.YARN_CONTAINER_RUNTIME_TYPE=docker \
  -Dmapreduce.map.env.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS \
  -Dmapreduce.map.env.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID \
  -Dmapreduce.reduce.env.YARN_CONTAINER_RUNTIME_TYPE=docker \
  -Dmapreduce.reduce.env.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS \
  -Dmapreduce.reduce.env.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID \
  1 40000

注意：主节点、map和reduce任务是独立配置的。

(2) Spark: 容器中运行Spark Shell

假设Spark安装在/usr/local/spark

HADOOP_HOME=/usr/local/hadoop
SPARK_HOME=/usr/local/spark
MOUNTS="$HADOOP_HOME:$HADOOP_HOME:ro,/etc/passwd:/etc/passwd:ro,/etc/group:/etc/group:ro"
IMAGE_ID="library/openjdk:8"

$SPARK_HOME/bin/spark-shell --master yarn \
  --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
  --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID \
  --conf spark.yarn.appMasterEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS \
  --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_TYPE=docker \
  --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_IMAGE=$IMAGE_ID \
  --conf spark.executorEnv.YARN_CONTAINER_RUNTIME_DOCKER_MOUNTS=$MOUNTS

(3) ENTRYPOINT支持

Hadoop 2.x引入了Docker支持，可用于在Docker容器中运行已有的Hadoop程序，整合了日志和环境设置到NodeManager中。

Hadoop 3.x支持使用ENTRYPOINT的Docker原生形式。

通过配置YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE，应用可以选择使用YARN模式还是Docker模式。

在yarn-site.xml中设置环境白名单：

<property>
        <name>yarn.nodemanager.env-whitelist</name>
        <value>JAVA_HOME,HADOOP_COMMON_HOME,HADOOP_HDFS_HOME,HADOOP_CONF_DIR,HADOOP_YARN_HOME,HADOOP_MAPRED_HOME,YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE</value>
</property>

在yarn-env.sh中设置：

1	export YARN_CONTAINER_RUNTIME_DOCKER_RUN_OVERRIDE_DISABLE=true

(4) 不使用ENTRYPOINT时的YARN模式要求

1) /bin/bash

需要在容器中可用。在最小镜像中可能没有安装bash。

2) find

需要在容器中可用。

12 YARN SysFS支持

是YARN框架支持的一个用于输出集群信息的伪文件系统。输出路径为/hadoop/yarn/sysfs。

允许开发者在没有外部服务支持的情况下，使用NodeManager REST API访问集群信息。

13 容器服务模式

运行容器，但是没有设置用户和用户组。

默认关闭。管理员可以在container-executor.cfg设置docker.service-mode.enabled开启。

部分示例设置如下：

yarn.nodemanager.linux-container-executor.group=yarn
[docker]
  module.enabled=true
  docker.privileged-containers.enabled=true
  docker.service-mode.enabled=true

应用可以在设置YARN_CONTAINER_RUNTIME_DOCKER_SERVICE_MODE环境控制。